Skip to content

perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128

Merged
JoeMatt merged 3 commits intodevelopfrom
feature/op-perf-baseline
May 2, 2026
Merged

perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC#128
JoeMatt merged 3 commits intodevelopfrom
feature/op-perf-baseline

Conversation

@github-actions
Copy link
Copy Markdown

@github-actions github-actions Bot commented May 1, 2026

Summary

Set up the OP perf branch per the plan, then profiled before writing any code. The profile data redirects the entire perf phase: OP isn't the bottleneck (1% on Iron Soldier, doesn't crack top-15 elsewhere); blitter isn't either except on the opt-in accurate path; GPU and DSP RISC interpretation dominate every ROM tested.

This PR is just the doc + the Makefile fix to make commercial ROM paths work in make benchmark. The headline finding lives in docs/op-perf-profile.md.

Profile (TL;DR)

ROM Hot subsystem % of frame
yarc.j64 (demo) GPUExec ~74%
Iron Soldier (commercial) DSPExec + opcodes ~67%
Skyhammer (fast blitter) GPUExec + DSPExec ~57% combined
Skyhammer (accurate blitter) BlitterMidsummer2 + DSP + GPU 21% / 22% / 11%
OP across all ROMs not in top-15, ~1% on Iron Soldier n/a
68K across all ROMs m68k_execute 0.7-2.6% n/a

What this changes

Methodology

  • make benchmark BENCH_ROM=<rom> BENCH_BLITTER={fast,accurate} — 600 frames + 60 warmup, headless, no presentation
  • /usr/bin/sample <pid> 8 -file <out> over a 6000-frame benchmark run
  • 6 ROMs: yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer
  • Apple Silicon, default release flags + RELEASE_DEBUG_INFO=1

Test plan

  • All 6 ROMs benchmark without errors after the BENCH_ROM quoting fix
  • Sample profiles capture meaningful symbols (built with -g, no inlining lost the obvious targets)
  • Findings reproducible — same ROM, same blitter mode → same hot function within ±5%
  • CI green

Branch base

  • Base is develop (per GitFlow)

Captured baseline FPS + sample-based hot-function breakdown across
6 ROMs (yarc, jagniccc, Iron Soldier 1/2, Doom, Skyhammer) on
Apple Silicon.

Result contradicts the assumptions in spikes #123 (OP) and #124
(blitter):

  yarc.j64        -> GPUExec ~74% of frame time
  Iron Soldier    -> DSPExec ~67% of frame time
  Skyhammer fast  -> GPU+DSP ~57% of frame time, blitter <2%
  Skyhammer acc   -> blitter ~21%, but only on opt-in accurate
  OP              -> 1% on Iron Soldier, doesn't crack top-15 elsewhere
  68K             -> 0.7-2.6% everywhere; JIT not worth the licensing

The right next perf target is GPU/DSP RISC dynarec or cached IR
(half of issue #122).  Both share the same Tom RISC ISA, so a
single dispatcher attacks both.

docs/op-perf-profile.md captures the methodology, the ROM-by-ROM
numbers, and the recommendation.

Drive-by: quote $(BENCH_ROM) in the Makefile benchmark recipe so
ROM paths with spaces / parens (every commercial ROM filename)
work.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@JoeMatt JoeMatt changed the title Update from feature/op-perf-baseline perf: profile baseline -- redirect from OP/blitter to GPU/DSP RISC May 1, 2026
Original profile only sampled headless boot/menu state, which doesn't
exercise the blitter much.  Added --load-state to test_benchmark
(handles RetroArch RASTATE container by extracting the MEM chunk),
loaded the user's AvP state6 save (active gameplay), and re-profiled.

Result: the spike reports' hypothesis about the blitter was right --
it just didn't show up in boot profiles.

  AvP gameplay, fast blitter:
    DSP ~63% (same as before), blitter 5%, GPU 4%, OP 1%
  AvP gameplay, accurate blitter:
    BlitterMidsummer2+ADDARRAY+DATA ~34% (matches docs/spike #124),
    DSP ~31%, GPU small

OP stays at ~1% even on gameplay -- closing #123 as wontfix-for-now is
still the honest call.

Updated recommendation: two genuine targets, suggested order:
  1. Accurate-blitter SIMD widening (#124) -- smaller surface, lower
     risk, immediate visible win for AvP-style accurate-blitter
     slowdown.  ADDARRAY (src/tom/blitter.c:2090-2094) is the obvious
     first target.
  2. RISC dynarec / cached IR (#122 RISC half) -- bigger lift, helps
     every game on every blitter mode.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: macos-arm64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Darwin arm64

Updated by CI at 2026-05-02T02:39:26.395Z

@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: linux-x64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Linux x86_64

Updated by CI at 2026-05-02T02:39:45.599Z

@JoeMatt JoeMatt self-assigned this May 2, 2026
@JoeMatt JoeMatt marked this pull request as ready for review May 2, 2026 00:46
@JoeMatt JoeMatt self-requested a review as a code owner May 2, 2026 00:46
Copilot AI review requested due to automatic review settings May 2, 2026 00:46
@github-actions
Copy link
Copy Markdown
Author

github-actions Bot commented May 2, 2026

Regression: linux-arm64

Regression Test Results

ROM Status Details Diff
jagniccc ✅ PASS 0 pixels differ -
yarc ✅ PASS 0 pixels differ -
jagniccc (determinism) ✅ PASS identical across runs -
yarc (determinism) ✅ PASS identical across runs -
jagniccc (frameskip) ✅ PASS skip=0 matches skip=3 -
yarc (frameskip) ✅ PASS skip=0 matches skip=3 -
jagniccc (save state) ✅ PASS round-trip matches -
yarc (save state) ✅ PASS round-trip matches -
jagniccc (rewind) ✅ PASS rewind matches -
yarc (rewind) ✅ PASS rewind matches -

Platform: Linux aarch64

Updated by CI at 2026-05-02T02:39:56.368Z

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Establishes a performance baseline for Virtual Jaguar’s OP/blitter vs GPU/DSP execution, and updates the benchmark harness so commercial ROM paths work reliably (and benchmarks can load a save state to profile gameplay rather than boot/menu idle).

Changes:

  • Add a new profiling write-up documenting measured hotspots and updated optimization priorities (docs/op-perf-profile.md).
  • Extend the headless benchmark tool with --load-state support (raw retro_serialize() payload or RetroArch RASTATE “MEM” chunk).
  • Quote $(BENCH_ROM) in make benchmark so paths with spaces work.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
test/tools/test_benchmark.c Adds --load-state and RASTATE MEM-chunk extraction to allow profiling from in-game states.
docs/op-perf-profile.md New documentation capturing the baseline/profile results and recommended perf focus.
Makefile Quotes $(BENCH_ROM) in the benchmark target invocation to support ROM paths with spaces.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/tools/test_benchmark.c Outdated
Comment on lines +383 to +398
uint8_t *st_buf;
const uint8_t *payload;
size_t payload_size;
size_t expected;
fseek(stf, 0, SEEK_END);
st_total = ftell(stf);
fseek(stf, 0, SEEK_SET);
st_buf = (uint8_t *)malloc(st_total);
if (fread(st_buf, 1, st_total, stf) != (size_t)st_total)
{
fprintf(stderr, "ERROR: short read on state file\n");
free(st_buf); fclose(stf); return 1;
}
fclose(stf);
payload = st_buf;
payload_size = (size_t)st_total;
Comment thread test/tools/test_benchmark.c Outdated
Comment on lines +405 to +417
while (p + 8 <= end)
{
uint32_t chunk_size = (uint32_t)p[4] | ((uint32_t)p[5] << 8)
| ((uint32_t)p[6] << 16) | ((uint32_t)p[7] << 24);
if (memcmp(p, "MEM ", 4) == 0)
{
payload = p + 8;
payload_size = chunk_size;
found = 1;
break;
}
p += 8 + chunk_size;
}
Comment thread test/tools/test_benchmark.c Outdated
Comment on lines +375 to +380
FILE *stf = fopen(state_load_path, "rb");
if (!stf)
{
fprintf(stderr, "ERROR: cannot open state file: %s\n", state_load_path);
return 1;
}
Three concerns flagged by automated review on PR #128:

* Validate fseek/ftell/malloc/fread when reading the state file
  (negative size, NULL malloc, short read) instead of trusting them.
* Bounds-check RASTATE chunk_size against the buffer end before
  treating any bytes past the chunk header as payload, so a
  truncated/corrupt container can't push retro_unserialize past the
  allocation.
* Route every state-load failure through a single cleanup label that
  closes the FILE, frees the buffer, and tears down retro_load_game /
  retro_init / dlopen before returning, matching the cleanup in the
  normal exit path.

No functional change for valid state files.  Smoke-tested with
make benchmark (no --load-state) -- still reports FPS as before.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@JoeMatt JoeMatt merged commit 662b13f into develop May 2, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants